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ABSTRACT 

An Educational Testing Service (ETS) procedure was 
evaluated, which is based on item response theory and estimates true 
scores on tests not taken. The reading, vocabulary, and mathematics 
tests of high school seniors from the National Longitudinal Study 
(NLS) of 1972 and the High School and Beyond (HSB) seniors of 1980 
and 1982 were found to share common blocks of items but differ in 
terms of total test length and overall difficulty. As the common 
blocks were too short to permit reliable comparisons, LOGIST was used 
to simultaneously estimate the item and person parameters for all 
three cohorts. The estimate of a student's ability (theta) based on 
the achievement test taken was combined with the item parameters for 
a test not taken. Thus, a number-right true score or "estimated 
number-right score 11 was obtained for a 1982 HSB senior on a 1972 NLS 
mathematics test. Since the group's expected true score equaled its 
expected number-right score, the desired cohort comparisons across 
time could be made. In an evaluation of this method, which generates 
hypothetical true scores based on tests not actually taken, 
comparable test forms (X and Y) of varying common block size were 
simulated for three conations. Two hundred simulated examinees each 
for the three conditions were used. For each simulated examinee, the 
probability of a correct response was computed for each item. 
Comparisons of simulated true scores with true scores based on LOGIST 
estimates were made for both the tests taken and the hypothetical 
tests not taken. Results indicate that the ETS method should remain 
experimental. (TJH) 
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" ■ the-Estiiratl^h^^ for Tests Not Taken 

* The purpose of this paper is to evaluate an Educational 

Testing Service (ETS> procedure described by Pollack at the 1985 
annual AERA meeting. The ETS procedure is based on item response 
theory (IRT) and allows the estimation of true scores on tests 
not taken. In an effort to investigate the apparent decline in 
achievement test scores, ETS researchers compared high school 
seniors from the National Longitudinal Study (NLS> of 1972 with 
the High School and Beyond (HSB) seniors of 1980 and 1982. The 
reading, vocabulary, and mathematics tests shared common blocks 
of items but differed with respect to total test length and 
overall difficulty. Since the common blocks were too shcrt to 
permit reliable comparisons, ETS used LOGIST (1982) to 
simultaneously estimate the item and person parameters for all 
three cohorts. Thus, the estimate of a student's ability (theta) 
based on the achievement test the student had taken could be 
combined with the item parameters for a test the student had not 
taken. In this way, a number — right true score or "estimated 
number-right score" could be obtained for, say, a 1982 HSB senior 
on a 1972 NLS mathematics test (Pollack, 1985, p. 10). Since the 
group's expected true score is equivalent to its expected number — 
right score (Lord & Novick, 1968), the desired cohort comparisons 
across time could be made. Clearly, a method which generates 
hypothetical true score. s based on tests not actually taken needs 
to be evaluated. 

Method ' 

The author designed a simulation study where LOGIST-based 
true score estimates could be compared to simulated true scores 
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* 

with respect to overall bias, relative variability, and linear 
association. Comparisons were made -for several alternate forms 'o-f 
a test, as well as an equivalent -form -for each o-f three 
conditions. The model and experimental design will be presented 
below. 
Model : 

The three-parameter logistic model specifies the probability 
that a person of a given ability (B^) will correctly answer a 
dichotomously scored item: 

Pi(8 k ) = Ci + (l-ci)/(-l+e" 1 " 7a i (B *~ b c } > (1) 
where (8^) = the probability of a correct response to the i th 
item by the k*" b person 
c = pseudo-chance score levsx 
e a 2.71828 ... 

a = discriminating power of the item, and 
b = the item difficulty, a location parameter 
relative to the theta scale (Birnbaum in Lord, 1980, p. 12) . 

Some of the major testing companies are utilizing Rasch-like 
three-parameter logistic models where the discrimination 
parameter 'a' is fixed, the pseudo-guessing parameter 'c' is 
fixed, and the item difficulty parameter *b* is free to vary. 
This model is "Rasch-like" since only the b^'s are varying. 
However, once the pseudo-guessing value is specif ied y both one- 
parameter or Rasch models, and two-parameter models are strictly 
ruled out. 
Simulated Tests : 

Comparable test forms (X and Y) of varying common block size 
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were simulated for three conditions. In each case, n x =n y =20 items 
where c>=1.0 and c=.2 for all items* The following item parameters 
were input to the simulation phase and not estimated: 



Condition £ (75% of the items are shared by tests X and Y) : 

-1-5 <= b x <= 1.0 b x = -.270 SD bx = .615 

-1.3 <= b y <= 1.0 b y = -.120 SD b = .616 

Condition II i507. of the items shared) : 

-1.5 <= b x <= 1.0 b x = -.270 SDw = .615 

-1.3 <= b y <= 1.0 b y = -.148 SD b = .623 

Condition III (No items shared): 



-1.5 <= b x <= 1.0 b x ■ -.270 SD b = .615 

-1.3 <= by <= 1.0 by = -.155 SD b ^ = .684 



Test X is the same across conditions, while test Y differs 
slightly; test X is always somewhat easier than test Y. 

The item parameters are modifications of an array used by 
Yen (1984, p. 96) in a simulation study. 
Simulated Examinees : 

Two hundred simulated examinees for each condition were 
obtained by using the RANNOR function in SAS (1982) which 
randomly generatss an observation from a normally distributed 
population with mean equal to zero and variance equal to one. 
Each set of thetas (examinees) was rescaled to (0,1) to adjust 
for sampling fluctuations. This rescaling is necessary for some 
comparisons of simulated thetas vs thetas based on LOGIST 
estimates since the theta scale is indeterminate and needs to be 
fixed (Lord, 1980, p. 36). By default, LOGIST standardizes theta 
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to (0,1) in two of the -four estimation steps and adjusts the item 
parameters accordingly (LOGIST, 1982, p. 15). 
Simulated Response Vectors : 

For each simulated examinee, tile (S^) was computed -for 
each item- To generate a response vector -for that person over all 
items, the SAS (1982) RANUNI -function was called. This -function 
returns a uniform deviate on the interval (0,1). A response was 
coded as 1 = correct i-f the random number -from RANUNI was less 
than or equal to (8 k ) , and 0 = incorrect otherwise. The 
resultant matrix of responses (U) was 200x40 . The first 100 
examinees were considered to be Sroup I, the second 100 examinees 
as Group II. Thus, the matrix of responses could be partitioned 
into quadrants (see Figure 1) . Quadrants (ii) and (iii) 



Insert Figure 1 about here. 



correspond to the hypothetical tests or the tests not taken and 
are later coded as 'not reached* for LOGIST. The input dataset 
for LOGIST has missing data in these quadrants and parallels the 
case where Group I took test X and Group II took test Y. However, 
the data corresponding to the tests not taken are saved from the 
simulation phase for eventual comparisons of simulated vs 
estimated true scores. 

A matrix of responses was generated for each of the three 
conditions where the relative sizes of the common blocks of items 
were 75%, 50%, and no common block for conditions I, II, and III, 
respectively. 

* 
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True Scores for Si mulated Tests : 

The rows of U within a test were summed to obtain number — 
right scores. Tha p-values (P^e^)) -for each examinee over the 
items o-f a given test were summed to obtain number-right true 
scores since 



where ^k ~ the k person's number right true score (Lord, 1980, 
pp. 45-46) . 
LOG 1ST Estimation : 

The input data matrix -for LOSIST was a trans-formed U matrix, 



response vector consisted o-f 1's, 0's, and 3's where l=correct, 
0=incorrect, and 3=not reached. Thus, an estimated theta is based 
on the responses to items on the test taken by the examinee — an 
examinee is not penalized -for the items coded as 'not reached.' 
Although the estimation o-f item parameters -for both tests is 
simultaneous and has been re-f erred to by others as "concurrent 
calibration" (c-f. Petersen, Cook, & Stacking, 1983), items -far 
test X are, in e-f-fect, calibrated on Group I since noone from 
Group II took test X, at least as coded -for LOGIST. Similarly, 
items -for test Y are calibrated on Group II. The LOGIST input 
dataset (-for each condition) is depicted in Figure 2. 



th 




(2) 



where U has been described above. 



Each simulated examinee' s 



Insert Figure 2 about here. 



The limits o-f the theta scale were constrained to be + 3.0 
since most practitioners would ordinarily use these limits. Since 
b's and theta's needed to be estimated (a's and c's were -fixed), 
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only the -first estimation step was invoked. The convergence 
criterion was set to the default criterion • value normally 
associated with the fourth and final step of a full LOGIST run 
(LOGIST, 1982, pp. 13-14). 

The advantage of inputting the data as described above, is 
that the simultaneous estimation of person parameters and item 
parameters over the two groups results in estimates on a common 
scale (Pollack, 1985, p. 9; Hambleton & Swaminathan, 1985, p. 212). 

Since LOGIST cannot return finite maximum likelihood 
estimates of theta for examinees with perfect scores, ft was 
necessary to look at the distributions of estimated thetas from 
each condition to select an appropriate theta value for thesa 
examinees. Simulating the case where a practitioner would be 
forced to assign a theta value, one tenth of a unit was added to 
the rounded, largest estimated theta so that the distribution 
would remain relatively ccntinuous. These values were then used 
in subsequent analyses. 

The person and item parameter estimates were later used to 
obtain p-values and true scores. In order to obtain true scores 
on the tests not taken, the theta estimates from the group of 
interest were combined with the item parameters on the test not 
taken by that group to generate p-values which were then summed. 
Similarly, to obtain true scores on tests taken the theta 
estimates from the group of interest were combined with the item 
parameters on the test taken by that group. 

Schematically, the blocks of true scores may be envisioned 
as in Figure 3. 
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Insert Figure 3 about here. 

True Score and Theta Comparisons : 

Comparisons of simulated true scores with true scores based 
on LOBIST estimates were made -for both the tests taken and the 
hypothetical tests not taken. Several statistics were computed: 
(1) the standardized mean difference < where the denominator is a 
pooled estimate of the standard deviation) as a measure af 
overall bias in that systematic errors of estimation are detected 
(Yen, 1984); (2) the ratio of standard deviations to assess the 
degree of homogeneity of variability; and (3) Pearson correlation 
coefficients to assess the degree of linear association between 
the two sets of true scores. Pearson correlations were also 
computed to assess the relation between simulated and estimated 
thetas. 

Results 

LOG I ST Estimates of Theta vs Simulated Thetas : 

The estimated thetas from LOBIST and the simulated thetas 
are distributed as (0,1) over both groups of examinees. While the 
limits of the theta scale in LOGIST were constrained to be +3.0, 
the obtained distributions of estimated thetas were truncated at 
the upper end. The correlations between the estimated and 
simulated thetas were consistently good over all three conditions 
(see Table 1 ) . 



Insert Table 1 about here. 
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True score compari sons : 

The results for the tests taken, as well as the hypothetical 
tests not tahen are reported in Table 2. 



Insert Table 2 about here. 

Tests Taken : The overall bias as measured by the 
standardized mean difference of the true scores is quite small 
for the tests taken: the bias ranges from approximately zero to 
5% of a pooled standard deviation over test forms and conditions. 
The two sets of true scores are homogeneous with respect to 
variability; and the correlations are consistently good. 

Hypothetical Tests Net Taken : There is a modest indication 
of overall bias in the standardized mean difference of the true 
scores for the hypothetical tests not taken. The mean true scores 
based on LOGIST estimates for test X underestimate the * simulated 
mean true scores in each condition; whereas for test Y, the 
LOGIST-based mean true scores overestimate the simulated mean 
true scores. The bias over test forms and conditions ranges from 
approximately 77i of a pooled standard deviation for test Y in 
condition III to 17% for test Y in condition II. 

The variability is again, quite homogeneous; and the 
correlations are consistently good over test forms and conditions 
(see Table 2) . 

Discussion 

The method of si mul taneous est i mat i on or concurrent 
calibration of person and item parameters using LOGIST has been 
discussed by Pollack (1985), Petersen et al (1983), and Hambleton 
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and Swaminathan (1985) in the context of anchor test designs 
where a common block of items exists and where the comparison 
groups of examinees may not overlap. Concurrent calibration is 
used to link items or to equate tests. For this reason, the three 
conditions reported here involved varying ths magnitude of the 
common block of items from 75/C in Condition I, to 50% in 
Condition II, to no common block (all items unique) in Condition 
III- However, the similarity of the reported statistics nver all 
three conditions for tests taken, as well as not taken, suggests 
that the equivalence (within sampling error) of the two groups 
sampled from the same population overrides the necessity cf 
equating through a common block of items; the "conditions" 
reported here are better thought of as occasions for replication 
of results comparing a particular test form's simulated thetas 
and true scores with the corresponding estimates* 

To assess the impact of varying the size of the common 
block of items where the comparison groups are not equivalent, 
the input dataset for LOBIST corresponding to the response matrix 
U would have to be modified for each condition so that the shared 
items occur only once. Presently, the two tests are adjoined. The 
consequence of varying the common block size in this study merely 
introduced varying degrees of test form comparability. 

The correlational results were initially surprising inasmuch 
as varying the common block size or comparability of test forms 
made no difference: the correlations between simulated and 
LOGIST-based true scores remained consistently good over al 1 
conditions. This was puzzling for it was predicted that the 
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correlations -for the hypothetical tests not taken would be 
especially poor in the third condition (no common block). 
However, these results may be explained by the -fact that the two 
groups of simulated examinees were representative samples of a 
normal distribution with mean equal to zero and variance equal to 
one. A check on the distributions of theta for groups I and II 
confirmed that while there were minor sample fluctuations, each 
goup in the simulation phase was representative the 
population* 

To better understand the impact of equivalent groups in the 
present study, recall that the number — right true score for a 
given examinee is a sum of the probabilities (p-values) of 
responding correctly to each item on a test. A p-value is a 
function of the estimated theta for that person as well as the 
estimated item parameters for a particular item. We have, as a 
consequence of the present design: 

^gk = * <Pi <0gk>> and (3) 

p i (8 gk> = * (e gk> a gi> b gi>c gi > (4) 

* th 
whereJgk = number — right true score for the k examinee 

4-u th 
in the g group and 

Pi (Bg^) = the probability of responding correctly to the 
-th _ . ,th 

i item given the k examinee in group g . 
In the case where a true score was computed for an examinee, 
say, in Group I on test X, a test taken for this group and coded 
as such in a LOGIST run, we have p-values of the form: 

P i (8 Ik ) » f (B Ik ,a Ii ,b Ii ,c Ii ) • (5) 
Note that the p-value in equation (5) is a function of item 
parameters calibrated on the same group from which the theta 
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estimate is derived. However, in the case where a true score was 
computed -for an examinee, say, in Group I on Test Y , a 
hypothetical test not taken -for this group, we have p-values of 
the form: 

p i (e Ik> B *ffllkiaiIiibiiiiCHi) • (6) 
In equation (6) we see that the item parameters are not 
calibrated on the same group -from which the examinee came. 

I-f the items are calibrated on a group of examinees that is 
not quite equivalent to the comparison group, and if the item 
parameters are similarly distributed over both tests , then the 
estimated mean true score difference where one of the groups has 
not actually taken the test will probably be a somewhat biased 
estimate of the mean group difference had both groups taken the 
test. To the extent that groups of examinees are equivalent and 

fry 

representative samples of some population, and the items of the 
tests are sampled from the same unidimensional domain, bias 
should' be eliminated since estimates of theta are invariant with 
respect wo the particular subset of items answered; similarly, 
estimates of item parameters are invariant to the particular 
subgroup or sample of examinees. In other words, as long as the 
complete latent space is correctly specified, bias should be 
effectively zero. 

The results reported in Table 2 suggest that the bias is 
quite small for tests taken since examinee true scores were based 
on items calibrated on the same group from which examinees came. 
We may infer that the L06IST estimates of b's and thetas were 
quite faithful (within a scale transformation) to the simulated 
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parameters when a's and c*s were -Fixed. Note also that the 
samples of the present study were very small: 100 simulated 
examinees for each of two tests, and 20 items per test. 

In conclusion, the modest overall bias reported here far 
what could be considered a "best case 11 scenario suggests that 
until the effects of using non-equivalent samples , and/or using 
sets of item parameters which are discrepant aver the tests of 
interest are known, the comparison of cohorts in the manner 
described above should remain experimental. In particular, the 
method should not be used for assessing cohort differences until 
the likely direction and magnitude of bias given the constraints 
of any specific study "are well understood. 
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Figure 1 

Partitioned Matrix U of Simulated 
Examinees by Item Responses 



Group I 



100 



Group I I 



200 \ 
examinees 




Common Block 
of items 
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Figure 2 
Design of a LOG 1ST Input Oataset 



test X test Y 

1 20 40 items 
/— \ 

Group I ! l's and 0's ! 3's .' 

100 s ': ■ 

Group II ! 3's ! l's and 0's ! 

: : j 

200 \ / 

examinees 
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Figure 3 

Corresponding Blocks of True Scores for Tests Taken and 
for Tests Not Taken Based on LDGIST Estimates 
and on Simulated Values 



LOG 1ST Phase 



Simulation phase 



test X 



test Y 



test X test Y 



Group I 



Group II 



TXL 


\ 

TYL AS IF ! 


i TXL AS IF 
\ 


TYL 

! 



) 

TXI 


\ 

1 
1 

TYI 1 

1 
1 


: txii 
\ 


TYI I 
— 1 



where 



tests taken 

tests NOT taken 



TXL = true scores on test X, Group I 
TXL AS IF = hypothetical true scores on 

1 test X, Group II 

TYL = true scores on test Y, Group II 
TYL AS IF = hypothetical- true scores on 

test Y, GfctfUp I 
TXI = true scores on test X fbr Group I 
TXII = true scores on test X for Group II 
TYI = true scores on test Y for Group I 
TYI I = true scores on test Y for Group II 



/ Scares are 
! functions 
! of LDGIST 
\ estimates 



Scores are 
functions 
of simul- 
ation 
parameter 
values 
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Table 1 

LOB I ST Estimates of Theta vs Simulated Thetas 



Condition r (8l 0GIST , 
^simuln* 



Limits o-f 
simulated 
thetas 



Limits o-f 
LOG I ST 
estimates of 
theta 



II 



III 



.894 



881 



.868 



-2.59, 2.60 
-2.48, 2.54 
-2.48, 2.76 



-3.00, 1.68 
-3.00, 1.84 
-3.00, 1.74 



Note: The missing LQGIST estimates for examinees with perfect 
scores were later coded as 1.8 in Conditions I and III, 
and 1.9 ir: Condition II . These values were used in the 
computation of the correlations reported here. 
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Table 2 
True Score Comparisons 



Tests Taken 


Condition 


I 


II 


III 


Relative size o-f 








common block of items 


75% 


507. 


none 


(TX-fSTu/SDo 


.002 


-.033 


.026 


^TV—TVI \ /en 
v 1 Y — 1 YL J /oUp 


~ . 024 


(TLA f% 
. 010 


< 054 




OTA 




• 73 X 


c*ni-w /^n -»-\/t 

uL/Ty / TYL 


• 7JO 


Q79 




/tv TVI \ 
r l 1 A, 1 AL) 


Q J J 

. 733 


. 710 


. 913 


r (TY, TYL) 


.903 


.905 


.904 


♦ 


Tests NOT Taken 






Condition 


I 


II 


III 


(TX-TXL)/SD P 


.127 


. 118 


. 152 


(TY-TYL) /SDp 


-. 128 


-. 171 


-.067 


SD TX /SD TXL 


.941 


.974 


.8B9 


SD TY /SD TY L 


.936 


1.031 


.954 


r (TX,TXL) 


.904 


.904 


.902 


r (TY, TYL) 


.931 


.910 


.914 



Note: TX = simulated true scores an test X 
TY = simulated true scores an test Y 
TXL = LDGIST-based true scares an test X 
TYL = LDGIST-based true scores on test Y 
SDp = pooled estimate of the standard deviation 
r = Pearson correlation coefficient 
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